Podcast Reviews Data Analysis

Table of contents

  • Project Brief
  • Business Stakeholders and Objectives
    • Stakeholder Identification
    • Stakeholder Objectives
  • Notebook Initialization and Exploratory Data Analysis (EDA)
    • Setting up the Coding Environment
      • Importing Libraries
      • Importing Functions
    • Loading the Data
      • Downloading Database
      • Data Brief
      • Cleaning the Data
    • Data Exploration Overview
      • Preliminary Plan for Data Exploration
      • Basic Exploration
      • Detailed Exploration
        • Data Sampling
        • Trends Characteristics Analysis
        • Statistical Interference
  • Conclusions
    • Insights and Findings
    • Recommendations for Action:
    • Further Areas for Investigation

¶

Project Brief

This database for this project is the iTunes Podcast Reviews, sourced from the Scraped iTunes podcast review RSS feeds. It contains the information that spans from 2019 to 2023 across USA, offering valuable time-series data in a specific location.

¶

Business Stakeholders and Objectives

¶

Stakeholder Identification

  • Podcast creators / Podcast sponsors: Interested in understanding audience preferences and improving their content based on feedback.
  • Marketing teams: Aim to identify effective marketing strategies and optimize promotional efforts.
  • Data analysts: Responsible for extracting insights from the data to inform decision-making.
¶

Stakeholder Objectives

  • Podcast creators: To identify popular podcast genres/topics and areas for content improvement through analysis of listener engagement and feedback.
  • Marketing teams: To analyze trends in podcast listenership and sentiment to inform marketing campaigns and target audience outreach.
  • Data analysts: To conduct thorough exploratory analysis to extract actionable insights from the podcast reviews dataset.

⇡¶

Notebook Initialization and Exploratory Data Analysis (EDA)

¶

Setting up the Coding Environment

¶

Importing Libraries, and Credentials

In [1]:
import os
from dotenv import load_dotenv

load_dotenv()

from numba import cuda

cuda.detect()

from sqlalchemy import create_engine
import sqlite3
from math import sqrt

%load_ext cudf.pandas
import pandas as pd
import numpy as np
import cudf
from concurrent.futures import ThreadPoolExecutor
from textblob import TextBlob
from collections import Counter

import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.tokenize import word_tokenize

nltk.download("vader_lexicon")
nltk.download("punkt")

import scipy.stats as stats
from scipy.stats import (
    f_oneway,
    chi2_contingency,
    norm,
    t,
    ttest_ind,
    kstest,
    levene,
    shapiro,
    kruskal,
)

from scikit_posthocs import posthoc_dunn

import statsmodels.api as sm
from statsmodels.stats.multicomp import pairwise_tukeyhsd
from statsmodels.stats.proportion import proportion_confint

import plotly
import plotly.express as px
import plotly.io as pio
import plotly.graph_objects as go
import plotly.subplots as sp

pio.renderers.default = "notebook"
plotly.offline.init_notebook_mode()
Found 1 CUDA devices
id 0    b'NVIDIA GeForce RTX 2060'                              [SUPPORTED]
                      Compute Capability: 7.5
                           PCI Device ID: 0
                              PCI Bus ID: 1
                                    UUID: GPU-d5fffe5d-eda6-e044-a96d-7e29d7648f51
                                Watchdog: Enabled
             FP32/FP64 Performance Ratio: 32
Summary:
	1/1 devices are supported
[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /home/cannelle/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
[nltk_data] Downloading package punkt to /home/cannelle/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
¶

Importing Functions

In [2]:
from utils.functions import (
    print_missing_and_duplicates,
    map_to_general_category,
    analyze_sentiment,
    sample_data_with_min_category_count,
    check_independence,
    check_sample_sizes,
)

⇡¶

Loading the Data

¶

Downloading Database

In [3]:
# kaggle_json_filename = "kaggle.json"
# notebook_directory = os.getcwd()
# kaggle_json_path = os.path.join(notebook_directory, kaggle_json_filename)

# if os.path.exists(kaggle_json_path):
#     os.environ['KAGGLE_CONFIG_DIR'] = notebook_directory
#     import kaggle
# else:
#     print("Error: kaggle.json file not found in the project root directory.")

# kaggle.api.authenticate()
# kaggle.api.dataset_download_files(dataset="thoughtvector/podcastreviews", path="./datasets", unzip=True)

# download_path = "./datasets"

# old_file_path = os.path.join(download_path, "database.db")
# new_file_path = os.path.join(download_path, "database.sqlite")

# if os.path.exists(old_file_path):
#     os.rename(old_file_path, new_file_path)
In [4]:
cnx = sqlite3.connect("./datasets/database.sqlite")
df = pd.read_sql_query("SELECT name FROM sqlite_master WHERE type='table'", cnx)

print(df)
         name
0        runs
1    podcasts
2  categories
3     reviews
In [5]:
categories = pd.read_sql_query("SELECT * FROM Categories", cnx)
podcasts = pd.read_sql_query("SELECT * FROM Podcasts", cnx)
reviews = pd.read_sql_query("SELECT * FROM Reviews", cnx)
runs = pd.read_sql_query("SELECT * FROM Runs", cnx)

display(categories.head(2))
display(podcasts.head(2))
display(reviews.head(2))
display(runs.head(2))
podcast_id category
0 c61aa81c9b929a66f0c1db6cbe5d8548 arts
1 c61aa81c9b929a66f0c1db6cbe5d8548 arts-performing-arts
podcast_id itunes_id slug itunes_url title
0 a00018b54eb342567c94dacfb2a3e504 1313466221 scaling-global https://podcasts.apple.com/us/podcast/scaling-... Scaling Global
1 a00043d34e734b09246d17dc5d56f63c 158973461 cornerstone-baptist-church-of-orlando https://podcasts.apple.com/us/podcast/cornerst... Cornerstone Baptist Church of Orlando
podcast_id title content rating author_id created_at
0 c61aa81c9b929a66f0c1db6cbe5d8548 really interesting! Thanks for providing these insights. Really e... 5 F7E5A318989779D 2018-04-24T12:05:16-07:00
1 c61aa81c9b929a66f0c1db6cbe5d8548 Must listen for anyone interested in the arts!!! Super excited to see this podcast grow. So man... 5 F6BF5472689BD12 2018-05-09T18:14:32-07:00
run_at max_rowid reviews_added
0 2021-05-10 02:53:00 3266481 1215223
1 2021-06-06 21:34:36 3300773 13139
¶

Data Brief

There are 4 tables:

  1. Categories - Categories data [podcast_id, category]
  2. Podcasts - Podcasts data [podcast_id, itunes_id, slug, itunes_url, title]
  3. Reviews - Reviews data [podcast_id, title, content, rating, author_id, created_at]
  4. Runs - Runs data [run_at, max_rowid, reviews_added]
¶

Cleaning the Data

Checking for duplicates, and missing values

Missing Values and Duplicates Across Columns
JUSTIFICATION FOR PROCESSING:
  • Missing values in a dataset can lead to inaccurate or misleading statistics and machine learning model predictions. They can occur due to various reasons such as data entry errors, failure to collect information, etc. Depending on the nature and extent of these missing values, different strategies can be employed to handle them.
  • Duplicate values in a dataset can occur due to various reasons such as data entry errors, merging of datasets, etc. Duplicates can lead to biased or incorrect results in data analysis. Therefore, it’s important to identify and remove duplicates.
In [6]:
print_missing_and_duplicates(categories, "Categories")
print_missing_and_duplicates(podcasts, "Podcasts")
print_missing_and_duplicates(reviews, "Reviews")
print_missing_and_duplicates(runs, "Runs")
Duplicates in Reviews table:
655
Drop Duplicative Rows:¶
In [7]:
reviews = reviews.drop_duplicates()

Chosen Strategy for Organizing Tables:¶

1. Merging Tables:

  • The three tables are merged based on the podcast_id value.
  • The rows are sorted based on this value.
  • Merging the tables has been chosen to streamline the workflow to facilitate the use of data for various comparisons.
In [8]:
merged_table = pd.merge(categories, podcasts, on="podcast_id", how="outer")
merged_table = pd.merge(merged_table, reviews, on="podcast_id", how="outer")

merged_table.sort_values(by=["podcast_id"], inplace=True)
merged_table = merged_table.reset_index(drop=True)

2. Preparing the data for display:

  • Mapping the category values to fit into one of the categories - Business & Finance, Religion & Spirituality, News & Politics, Sports & Recreation, Arts, Education, Society & Culture, TV & Film, Health & Fitness, Others, Music, True Crime, Comedy, History, Leisure, Kids & Family, Science, Fiction, Technology, Government
In [9]:
processed_dataset = merged_table.copy()
processed_dataset["category"] = processed_dataset["category"].apply(
    map_to_general_category
)
processed_dataset["podcast_title"] = (
    processed_dataset["title_x"].fillna("")
    + " "
    + processed_dataset["title_y"].fillna("")
)

processed_dataset.head(2)
Out[9]:
podcast_id category itunes_id slug itunes_url title_x title_y content rating author_id created_at podcast_title
0 a00018b54eb342567c94dacfb2a3e504 Business & Finance 1.313466e+09 scaling-global https://podcasts.apple.com/us/podcast/scaling-... Scaling Global Very informative Great variety of speakers! 5 CC47C85896D423B 2017-11-29T12:16:43-07:00 Scaling Global Very informative
1 a00043d34e734b09246d17dc5d56f63c Religion & Spirituality 1.589735e+08 cornerstone-baptist-church-of-orlando https://podcasts.apple.com/us/podcast/cornerst... Cornerstone Baptist Church of Orlando Good Sernons I'm a regular listener. I only wish that the ... 5 103CC9DA2046218 2019-10-08T04:23:32-07:00 Cornerstone Baptist Church of Orlando Good Ser...

OUTCOMES:

  • No missing values were found in the datasets.
  • 655 duplicates were found in the Reviews table and were subsequently dropped.
  • The decision was made to reorganize the data into one table, as this could potentially facilitate analysis in further steps.


⇡¶

Data Exploration Overview

Scrutinizing the dataset to identify key patterns, relationships, and trends. This process aids in detecting significant variables and anomalies, leading to more accurate predictions and insights.

¶

Preliminary Plan for Data Exploration

Basic Exploration

  1. Utilize the describe function to provide an overview of numerical and categorical features in each dataset.
  2. Check the distributions of podcasts over categories, ratings, and number of reviews over time.

Detailed Exploration

  1. Data Sampling:

    • Preparing a subset of data
  2. Trends Characteristics Analysis:

    • Understanding podcast listenership trends
    • Identifying popular podcast genres/topics
    • Analyzing sentiment of podcast reviews
  3. Statistical Interference:

    • Correlation between average ratings and voting counts
    • Variances in rating averages across podcast categories
    • Monthly variations in rating averages
¶

Basic Exploration

Summary of Dataset Features
JUSTIFICATION FOR PROCESSING: The processing of dataset features is essential for gaining insights into the underlying data distribution, identifying patterns, and facilitating informed decision-making.
In [10]:
print("Processed Dataset")
display(processed_dataset.describe(include=["object"]).T)
Processed Dataset
count unique top freq
podcast_id 4552196 111544 bf5bf76d5b6ffbf9a31bba4480383b7f 33100
category 4552196 20 Society & Culture 661552
slug 4527973 108919 crime-junkie 33100
itunes_url 4527973 110024 https://podcasts.apple.com/us/podcast/crime-ju... 33100
title_x 4527973 109274 Crime Junkie 33100
title_y 4552196 1138688 Great podcast 30828
content 4552196 2049707 I love this podcast! 404
author_id 4552196 1475285 D3307ADEFFA285C 1660
created_at 4552196 2054352 2017-09-19T08:29:49-07:00 14
podcast_title 4552196 1868684 Crime Junkie Obsessed 466
In [11]:
unique_ratings = processed_dataset["rating"].unique()
count_ratings = len(unique_ratings)
top_rating = processed_dataset["rating"].mode().values[0]
top_rating_freq = processed_dataset["rating"].value_counts().max()
total_ratings = processed_dataset["rating"].count()

print("Unique Ratings:", unique_ratings)
print("Count of Unique Ratings:", count_ratings)
print("Most Common Rating (Top):", top_rating)
print("Frequency of Most Common Rating:", top_rating_freq)
print("Total Number of Ratings:", total_ratings)
Unique Ratings: [5 1 4 2 3]
Count of Unique Ratings: 5
Most Common Rating (Top): 5
Frequency of Most Common Rating: 3982850
Total Number of Ratings: 4552196
Data Distribution Check
JUSTIFICATION FOR PROCESSING: The processing of this data is crucial for gaining insights into the trends and patterns across the database.
In [12]:
category_counts = processed_dataset["category"].value_counts()

fig = px.bar(
    x=category_counts.index,
    y=category_counts.values,
    labels={"x": "Category", "y": "Count"},
)

fig.update_layout(
    title="Podcast Counts by Category",
    xaxis_title="Category",
    yaxis_title="Podcast Count",
    template="plotly_dark",
)

fig.show()
In [13]:
rating_counts = processed_dataset["rating"].value_counts()

fig = px.bar(
    x=rating_counts.index, y=rating_counts.values, labels={"x": "Rating", "y": "Count"}
)

fig.update_layout(
    title="Podcast Counts by Rating",
    xaxis_title="Rating",
    yaxis_title="Podcast Count",
    template="plotly_dark",
)

fig.show()
In [14]:
runs["run_date"] = pd.to_datetime(runs["run_at"]).dt.date

reviews_added_per_day = runs.groupby("run_date")["reviews_added"].sum().reset_index()

fig = px.line(
    reviews_added_per_day,
    x="run_date",
    y="reviews_added",
    title="Reviews Added Over Time",
    template="plotly_dark",
)

fig.update_xaxes(title="Date")
fig.update_yaxes(title="Number of Reviews Added")

fig.show()

OUTCOMES:

  1. Podcast Counts by Category: The Society & Culture category has the highest number of podcasts (661,552k), followed by Business & Finance (435,586k) and Comedy (413,024k). Conversely, the categories with the fewest podcasts are Government (15,483k), Others (25,906k), and Technology (47,808k).
  2. Podcast Counts by Rating: The majority of podcasts received a rating of 5 (3.98 million), indicating the highest level of satisfaction, while the fewest were rated as 2 (94.62k) on a scale of 1-5.
  3. Reviews Added Over Time: The highest number of reviews were recorded on May 10, 2021 (exceeding 1.2 million), followed by July 3, 2022 (559,523k).

¶

Detailed Exploration

⇡¶

Data Sampling

The sampling method used in this exploratory data analysis (EDA) employs a random sample selection of 10% of the dataset's total rows to strike a balance between representativeness and computational efficiency. This percentage was chosen to ensure adequate coverage of dataset characteristics while minimizing computational resources required. Setting the random state parameter to 42 ensures reproducibility of results across analyses.

In [15]:
sampled_data = sample_data_with_min_category_count(processed_dataset)
display(sampled_data.head(2))
podcast_id category itunes_id slug itunes_url title_x title_y content rating author_id created_at podcast_title
4257723 f85fdf4372bbd041d91178b41cec9c62 News & Politics 6.667518e+08 james-obriens-mystery-hour https://podcasts.apple.com/us/podcast/james-ob... James O'Brien's Mystery Hour How can I possibly be only 2nd review for this... I live in Kansas City and am not able to liste... 5 9EEF534801232E5 2016-05-19 12:28:52-07:00 James O'Brien's Mystery Hour How can I possibl...
460510 aa41a90ae2ebfae71eb887fe9375b6d5 News & Politics 1.458648e+09 hear-the-bern https://podcasts.apple.com/us/podcast/hear-the... Hear the Bern I’m a big Bernie supporter but was still surpr... So you would assume a podcast about a presiden... 5 88BD551F001F391 2019-09-08 18:46:28-07:00 Hear the Bern I’m a big Bernie supporter but w...

⇡¶

Trends Characteristics Analysis

Understanding podcast listenership trends

In [16]:
most_rated_query_df = (
    sampled_data.groupby(["podcast_title"])
    .agg({"rating": ["count", "mean"]})
    .reset_index()
)
most_rated_query_df.columns = ["podcast_title", "rating_count", "avg_rating"]
most_rated_query_df = most_rated_query_df.sort_values(
    by="rating_count", ascending=False
).head(10)

most_rated_query_df_count = most_rated_query_df.sort_values(
    by="rating_count", ascending=False
)
best_rated_query_df_avg = most_rated_query_df.sort_values(
    by="avg_rating", ascending=False
)

fig1 = px.bar(
    most_rated_query_df_count,
    x="podcast_title",
    y="rating_count",
    title="Top 10 Podcasts by Review Frequency",
    template="plotly_dark",
    hover_data={"rating_count": True, "avg_rating": True},
)
fig1.update_xaxes(title="Podcast Title")
fig1.update_yaxes(title="Number of Reviews Received")

fig2 = px.bar(
    best_rated_query_df_avg,
    x="podcast_title",
    y="avg_rating",
    title="Top 10 Podcasts by Average Rating",
    template="plotly_dark",
    hover_data={"rating_count": True, "avg_rating": True},
)
fig2.update_xaxes(title="Podcast Title")
fig2.update_yaxes(title="Average Rating")

fig1.show()
fig2.show()

Identifying popular podcast genres/topics

In [17]:
most_rated_query_df = (
    sampled_data.groupby(["category"]).agg({"rating": ["count", "mean"]}).reset_index()
)
most_rated_query_df.columns = ["category", "rating_count", "avg_rating"]
most_rated_query_df = most_rated_query_df.sort_values(
    by="rating_count", ascending=False
).head(10)

most_rated_query_df_count = most_rated_query_df.sort_values(
    by="rating_count", ascending=False
)
best_rated_query_df_avg = most_rated_query_df.sort_values(
    by="avg_rating", ascending=False
)

fig1 = px.bar(
    most_rated_query_df_count,
    x="category",
    y="rating_count",
    title="Top 10 Categories by Review Frequency",
    template="plotly_dark",
    hover_data={"rating_count": True, "avg_rating": True},
)
fig1.update_xaxes(title="Podcast Category")
fig1.update_yaxes(title="Number of Reviews Received")

fig2 = px.bar(
    best_rated_query_df_avg,
    x="category",
    y="avg_rating",
    title="Top 10 Categories by Average Rating",
    template="plotly_dark",
    hover_data={"rating_count": True, "avg_rating": True},
)
fig2.update_xaxes(title="Podcast Category")
fig2.update_yaxes(title="Average Rating")

fig1.show()
fig2.show()

Analyzing sentiment of podcast reviews

In [18]:
sampled_data["sentiment"] = sampled_data["content"].apply(analyze_sentiment)

fig = px.histogram(
    sampled_data,
    x="sentiment",
    nbins=30,
    title="Sentiment Distribution of Podcast Reviews",
)
fig.update_layout(
    xaxis_title="Sentiment Polarity",
    yaxis_title="Frequency",
    bargap=0.05,
    template="plotly_dark",
)
fig.show()

OUTCOMES:

  • The most rated podcasts were "Crime Junkie Obsessed" with 48 ratings, "Crime Junkie Amazing" with 33 ratings, and "Crime Junkie Love!" with 29 ratings. These podcasts had respective average ratings of 5.00, 5.00, and 4.93.

  • The podcasts with the highest average ratings were "Crime Junkie Obsessed", "Crime Junkie Amazing", "Crime Junkie Obsessed", "Crime Junkie Amazing!", "Awesome", "Crime Junkie Addicted" all achieving a perfect average rating of 5.0.

  • The most rated podcast category was "Society & Culture," which amassed over 32 thousand ratings out of more than 4.5 million total ratings, accounting for approximately 0.7 % of the total. Other highly rated categories include "Sports & Recreation", with over 17 thousand ratings (0.3% of the total), and "TV & Film", with more than 15 thousand ratings (0.3%).

  • Podcasts in the Business & Finance category achieved the highest average ratings, with an impressive average of 4.87. Following closely are podcasts in the Religion & Spirituality category and Education category, both with an average rating of 4.83.

  • The sentiment analysis indicates a predominantly positive sentiment in podcast reviews, with the majority of ratings falling within the positive range.

INSIGHTS:

  • The most rated podcasts have accumulated more than 40 ratings each, with highly favorable average ratings ranging between 5.00 and 4.93.

  • Podcasts consistently rated with an average of 5.0 not only attract a substantial audience but also consistently deliver content that resonates exceptionally well, resulting in consistently high average ratings across episodes.

  • While "Society & Culture" may have the most rated podcasts, along with "Sports & Recreation", and "TV & Film" attracting a significant number of ratings, it's noteworthy that podcasts in the Business & Finance category receive the highest average ratings. This suggests that listeners highly appreciate the quality and value of content offered in this genre. Similarly, Religion & Spirituality and Education categories also boast exceptionally high average ratings, indicating strong listener satisfaction in these areas.

  • The distribution of sentiment analysis highlights a predominantly positive sentiment in podcast reviews, indicating high overall satisfaction among listeners.

⇡¶
Statistical Inference
The goal of Statistical Inference for Podcast Review EDA is to study the impact of podcast parameters on the distribution of their ratings. The differences in podcast categories' ratings, the number of people voting for them, and the existence of specific time periods associated with higher or lower ratings will be studied.

Category Rating Differences

This analysis focuses on examining whether there are rating differences between categories in the database.

Target Population: The target population consists of all podcast reviews available in the dataset. However, for the purpose of statistical inference, we are working with a sample of the data rather than the entire population.

Significance Levels:
The chosen significance level for hypothesis testing α = 0.05.

Confidence Intervals:
The confidence intervals for the mean ratings within podcast categories are as follows: Arts: [4.686, 4.733] Business & Finance: [4.860, 4.881] Comedy: [4.664, 4.696] Education: [4.815, 4.843] Fiction: [4.569, 4.649] Government: [4.496, 4.677] Health & Fitness: [4.781, 4.807] History: [4.667, 4.745] Kids & Family: [4.698, 4.740] Leisure: [4.750, 4.784] Music: [4.758, 4.802] News & Politics: [4.234, 4.281] Others: [4.631, 4.708] Religion & Spirituality: [4.822, 4.848] Science: [4.555, 4.627] Society & Culture: [4.641, 4.663] Sports & Recreation: [4.625, 4.656] TV & Film: [4.531, 4.566] Technology: [4.556, 4.631] True Crime: [4.147, 4.195] Each interval provides a range of plausible values for the true population parameter, the mean rating for podcasts within the corresponding category, with a specified level of confidence of 95%. For example, in the 'Arts' category, we can be 95% confident that the true mean rating falls within the interval [4.686, 4.733]. These confidence intervals help to assess the variability in ratings across different podcast categories.

In [19]:
mean_ratings = sampled_data.groupby("category")["rating"].mean()
sample_size = sampled_data.groupby("category")["rating"].count()
sample_std = sampled_data.groupby("category")["rating"].std()

confidence_level = 0.95
t_value = t.ppf((1 + confidence_level) / 2, df=sample_size - 1)

zero_std_mask = sample_std == 0
sample_std[zero_std_mask] = np.nan
sample_std_filled = sample_std.fillna(0)

small_sample_mask = sample_size < 2
t_value[small_sample_mask] = 0
sample_std[small_sample_mask] = np.nan
sample_std_filled = sample_std.fillna(0)

margin_of_error = t_value * sample_std_filled / np.sqrt(sample_size)
confidence_interval_means = (
    mean_ratings - margin_of_error,
    mean_ratings + margin_of_error,
)

print("Confidence Interval for Mean Rating within Categories:")
display(confidence_interval_means)

lower_bounds = confidence_interval_means[0]
upper_bounds = confidence_interval_means[1]

overall_lower_bound = np.min(lower_bounds)
overall_upper_bound = np.max(upper_bounds)

print("Overall Lower Bound:", overall_lower_bound)
print("Overall Upper Bound:", overall_upper_bound)
Confidence Interval for Mean Rating within Categories:
(category
 Arts                       4.686810
 Business & Finance         4.860490
 Comedy                     4.663586
 Education                  4.814655
 Fiction                    4.568826
 Government                 4.496118
 Health & Fitness           4.780999
 History                    4.666550
 Kids & Family              4.697983
 Leisure                    4.749629
 Music                      4.758497
 News & Politics            4.233139
 Others                     4.631171
 Religion & Spirituality    4.821502
 Science                    4.554748
 Society & Culture          4.641353
 Sports & Recreation        4.624880
 TV & Film                  4.530586
 Technology                 4.556431
 True Crime                 4.147117
 Name: rating, dtype: float64,
 category
 Arts                       4.732914
 Business & Finance         4.881052
 Comedy                     4.696280
 Education                  4.842816
 Fiction                    4.648811
 Government                 4.676677
 Health & Fitness           4.807036
 History                    4.745215
 Kids & Family              4.740142
 Leisure                    4.784429
 Music                      4.802010
 News & Politics            4.280996
 Others                     4.707833
 Religion & Spirituality    4.847650
 Science                    4.627287
 Society & Culture          4.663252
 Sports & Recreation        4.655529
 TV & Film                  4.566012
 Technology                 4.630872
 True Crime                 4.195473
 Name: rating, dtype: float64)
Overall Lower Bound: 4.147117490111806
Overall Upper Bound: 4.88105228429321

Data Distribution Check:

In [20]:
grouped_ratings = sampled_data.groupby("category")["rating"]

for category, ratings in grouped_ratings:
    print(f"Category: {category}")

    stat, p_value = shapiro(ratings)
    print("Shapiro-Wilk Test p-value:", p_value)

    stat, p_value = kstest(ratings, "norm")
    print("Kolmogorov-Smirnov Test p-value:", p_value)

    print("\n")
Category: Arts
Shapiro-Wilk Test p-value: 4.435199230339896e-91
Kolmogorov-Smirnov Test p-value: 0.0


Category: Business & Finance
Shapiro-Wilk Test p-value: 7.015005998773669e-119
Kolmogorov-Smirnov Test p-value: 0.0


Category: Comedy
Shapiro-Wilk Test p-value: 3.0021366917472324e-114
Kolmogorov-Smirnov Test p-value: 0.0


Category: Education
Shapiro-Wilk Test p-value: 8.955847023207261e-109
Kolmogorov-Smirnov Test p-value: 0.0


Category: Fiction
Shapiro-Wilk Test p-value: 2.0842131917784576e-66
Kolmogorov-Smirnov Test p-value: 0.0


Category: Government
Shapiro-Wilk Test p-value: 3.660556526348004e-38
Kolmogorov-Smirnov Test p-value: 0.0


Category: Health & Fitness
Shapiro-Wilk Test p-value: 2.012532254502817e-116
/tmp/ipykernel_7695/2213632488.py:6: UserWarning:

scipy.stats.shapiro: For N > 5000, computed p-value may not be accurate. Current N is 6104.

/tmp/ipykernel_7695/2213632488.py:6: UserWarning:

scipy.stats.shapiro: For N > 5000, computed p-value may not be accurate. Current N is 14099.

/tmp/ipykernel_7695/2213632488.py:6: UserWarning:

scipy.stats.shapiro: For N > 5000, computed p-value may not be accurate. Current N is 14372.

/tmp/ipykernel_7695/2213632488.py:6: UserWarning:

scipy.stats.shapiro: For N > 5000, computed p-value may not be accurate. Current N is 10294.

/tmp/ipykernel_7695/2213632488.py:6: UserWarning:

scipy.stats.shapiro: For N > 5000, computed p-value may not be accurate. Current N is 14108.

/tmp/ipykernel_7695/2213632488.py:6: UserWarning:

scipy.stats.shapiro: For N > 5000, computed p-value may not be accurate. Current N is 6998.

/tmp/ipykernel_7695/2213632488.py:6: UserWarning:

scipy.stats.shapiro: For N > 5000, computed p-value may not be accurate. Current N is 8559.

Kolmogorov-Smirnov Test p-value: 0.0


Category: History
Shapiro-Wilk Test p-value: 1.4460126372126276e-63
Kolmogorov-Smirnov Test p-value: 0.0


Category: Kids & Family
Shapiro-Wilk Test p-value: 9.028239730270882e-95
Kolmogorov-Smirnov Test p-value: 0.0


Category: Leisure
Shapiro-Wilk Test p-value: 2.0127116571910826e-101
Kolmogorov-Smirnov Test p-value: 0.0


Category: Music
Shapiro-Wilk Test p-value: 1.0564596796484096e-86
Kolmogorov-Smirnov Test p-value: 0.0


Category: News & Politics
Shapiro-Wilk Test p-value: 3.693276080744102e-103
Kolmogorov-Smirnov Test p-value: 0.0


Category: Others
Shapiro-Wilk Test p-value: 3.139281959947661e-68
Kolmogorov-Smirnov Test p-value: 0.0


Category: Religion & Spirituality
Shapiro-Wilk Test p-value: 6.3531948911195225e-111
Kolmogorov-Smirnov Test p-value: 0.0


Category: Science
Shapiro-Wilk Test p-value: 1.0528319397022419e-73
Kolmogorov-Smirnov Test p-value: 0.0


Category: Society & Culture
Shapiro-Wilk Test p-value: 2.0872461061468374e-136
Kolmogorov-Smirnov Test p-value: 0.0


Category: Sports & Recreation
Shapiro-Wilk Test p-value: 1.614874020997192e-118
Kolmogorov-Smirnov Test p-value: 0.0


Category: TV & Film
Shapiro-Wilk Test p-value: 8.852778681315428e-113
Kolmogorov-Smirnov Test p-value: 0.0


Category: Technology
Shapiro-Wilk Test p-value: 1.6473226779339912e-72
Kolmogorov-Smirnov Test p-value: 0.0


Category: True Crime
Shapiro-Wilk Test p-value: 1.7749459108517478e-100
Kolmogorov-Smirnov Test p-value: 0.0


/tmp/ipykernel_7695/2213632488.py:6: UserWarning:

scipy.stats.shapiro: For N > 5000, computed p-value may not be accurate. Current N is 13619.

/tmp/ipykernel_7695/2213632488.py:6: UserWarning:

scipy.stats.shapiro: For N > 5000, computed p-value may not be accurate. Current N is 11135.

/tmp/ipykernel_7695/2213632488.py:6: UserWarning:

scipy.stats.shapiro: For N > 5000, computed p-value may not be accurate. Current N is 32836.

/tmp/ipykernel_7695/2213632488.py:6: UserWarning:

scipy.stats.shapiro: For N > 5000, computed p-value may not be accurate. Current N is 17610.

/tmp/ipykernel_7695/2213632488.py:6: UserWarning:

scipy.stats.shapiro: For N > 5000, computed p-value may not be accurate. Current N is 15787.

/tmp/ipykernel_7695/2213632488.py:6: UserWarning:

scipy.stats.shapiro: For N > 5000, computed p-value may not be accurate. Current N is 13719.

Statistical Hypotheses:
Null Hypothesis (H0): There are no significant differences in rating averages among categories.
Alternative Hypothesis (H1): There are significant differences in rating averages among categories.

Hypothesis Testing:
To test the hypothesis regarding the difference in average ratings between different podcast categories, the Kruskal-Wallis test was conducted. This choice was made due to violations of the normality assumption observed in the data, as indicated by low p-values from the Shapiro-Wilk and Kolmogorov-Smirnov tests. The resulting test statistic of 7045.29 and a p-value close to 0.0 provided strong evidence against the null hypothesis, indicating significant differences in average ratings between podcast categories. By upholding the assumptions and interpreting the results, we conclude that there are meaningful disparities in ratings across different podcast categories.

In [21]:
category_groups = [group["rating"] for _, group in sampled_data.groupby("category")]
h_statistic, p_value = kruskal(*category_groups)

print(f"P-value is: {p_value}, and test statistic is: {h_statistic}")

alpha = 0.05
if p_value < alpha:
    print(
        "Kruskal-Wallis Test: There are significant differences in rating averages among categories."
    )
else:
    print(
        "Kruskal-Wallis Test: No significant differences in rating averages among categories were found."
    )
P-value is: 0.0, and test statistic is: 7045.288475815118
Kruskal-Wallis Test: There are significant differences in rating averages among categories.
In [22]:
dunn_results = posthoc_dunn(
    sampled_data, val_col="rating", group_col="category", p_adjust="bonferroni"
)
print(dunn_results)
                                  Arts  Business & Finance         Comedy  \
Arts                      1.000000e+00        6.547332e-29   1.000000e+00   
Business & Finance        6.547332e-29        1.000000e+00   6.339745e-47   
Comedy                    1.000000e+00        6.339745e-47   1.000000e+00   
Education                 1.739472e-15        4.632711e-01   1.775443e-23   
Fiction                   7.624549e-06        2.889253e-43   1.401408e-07   
Government                2.138811e-01        2.429155e-11   1.236089e-01   
Health & Fitness          1.678479e-08        7.772101e-09   9.263626e-14   
History                   1.000000e+00        6.314510e-15   1.000000e+00   
Kids & Family             1.000000e+00        1.313875e-30   1.000000e+00   
Leisure                   7.412572e-02        9.657623e-16   7.257435e-03   
Music                     1.000000e+00        3.651925e-12   7.458100e-01   
News & Politics          1.802237e-153        0.000000e+00  9.913747e-259   
Others                    1.000000e+00        3.298722e-22   1.000000e+00   
Religion & Spirituality   6.765378e-16        3.452566e-01   1.699017e-24   
Science                   4.432943e-06        4.643807e-52   2.456006e-08   
Society & Culture         2.912177e-01       1.422405e-105   4.170844e-04   
Sports & Recreation       1.031376e-01        2.785183e-89   2.301198e-04   
TV & Film                 2.467110e-21       5.671723e-175   1.708955e-38   
Technology                1.726284e-06        1.046303e-51   9.140113e-09   
True Crime               1.675757e-248        0.000000e+00   0.000000e+00   

                             Education       Fiction    Government  \
Arts                      1.739472e-15  7.624549e-06  2.138811e-01   
Business & Finance        4.632711e-01  2.889253e-43  2.429155e-11   
Comedy                    1.775443e-23  1.401408e-07  1.236089e-01   
Education                 1.000000e+00  4.416969e-31  1.926443e-08   
Fiction                   4.416969e-31  1.000000e+00  1.000000e+00   
Government                1.926443e-08  1.000000e+00  1.000000e+00   
Health & Fitness          4.605610e-01  8.669419e-24  3.855833e-06   
History                   5.527620e-09  8.417708e-02  1.000000e+00   
Kids & Family             4.529617e-16  1.849270e-06  1.544410e-01   
Leisure                   1.144903e-05  1.477820e-14  6.627118e-04   
Music                     8.278227e-05  3.289272e-11  2.569645e-03   
News & Politics           0.000000e+00  6.302417e-35  3.052015e-07   
Others                    4.993615e-14  4.735407e-01  1.000000e+00   
Religion & Spirituality   1.000000e+00  1.711756e-31  1.842690e-08   
Science                   1.792636e-36  1.000000e+00  1.000000e+00   
Society & Culture         1.203963e-56  6.091772e-03  1.000000e+00   
Sports & Recreation       1.106336e-50  4.060352e-02  1.000000e+00   
TV & Film                2.074659e-113  1.000000e+00  1.000000e+00   
Technology                1.528424e-36  1.000000e+00  1.000000e+00   
True Crime                0.000000e+00  8.344859e-69  1.960076e-15   

                         Health & Fitness       History  Kids & Family  \
Arts                         1.678479e-08  1.000000e+00   1.000000e+00   
Business & Finance           7.772101e-09  6.314510e-15   1.313875e-30   
Comedy                       9.263626e-14  1.000000e+00   1.000000e+00   
Education                    4.605610e-01  5.527620e-09   4.529617e-16   
Fiction                      8.669419e-24  8.417708e-02   1.849270e-06   
Government                   3.855833e-06  1.000000e+00   1.544410e-01   
Health & Fitness             1.000000e+00  4.559223e-05   9.560801e-09   
History                      4.559223e-05  1.000000e+00   1.000000e+00   
Kids & Family                9.560801e-09  1.000000e+00   1.000000e+00   
Leisure                      6.748705e-01  1.415928e-01   9.456220e-02   
Music                        6.673752e-01  8.575344e-01   1.000000e+00   
News & Politics              0.000000e+00  4.131891e-55  4.538777e-171   
Others                       7.292406e-09  1.000000e+00   1.000000e+00   
Religion & Spirituality      3.908836e-01  4.483518e-09   1.546357e-16   
Science                      5.776229e-28  1.513078e-01   8.093922e-07   
Society & Culture            7.646485e-44  1.000000e+00   5.932844e-02   
Sports & Recreation          2.558988e-38  1.000000e+00   2.076450e-02   
TV & Film                   2.970267e-101  2.436026e-05   1.233376e-24   
Technology                   3.717438e-28  8.244841e-02   3.088103e-07   
True Crime                   0.000000e+00  8.203795e-92  4.355996e-276   

                               Leisure          Music  News & Politics  \
Arts                      7.412572e-02   1.000000e+00    1.802237e-153   
Business & Finance        9.657623e-16   3.651925e-12     0.000000e+00   
Comedy                    7.257435e-03   7.458100e-01    9.913747e-259   
Education                 1.144903e-05   8.278227e-05     0.000000e+00   
Fiction                   1.477820e-14   3.289272e-11     6.302417e-35   
Government                6.627118e-04   2.569645e-03     3.052015e-07   
Health & Fitness          6.748705e-01   6.673752e-01     0.000000e+00   
History                   1.415928e-01   8.575344e-01     4.131891e-55   
Kids & Family             9.456220e-02   1.000000e+00    4.538777e-171   
Leisure                   1.000000e+00   1.000000e+00    4.636785e-251   
Music                     1.000000e+00   1.000000e+00    1.079902e-165   
News & Politics          4.636785e-251  1.079902e-165     1.000000e+00   
Others                    9.540704e-04   1.988885e-02     3.944493e-60   
Religion & Spirituality   7.366326e-06   6.419010e-05     0.000000e+00   
Science                   2.319729e-16   4.524920e-12     8.049610e-49   
Society & Culture         2.669238e-15   1.039558e-07    3.832457e-279   
Sports & Recreation       7.920740e-15   4.659819e-08    5.848267e-214   
TV & Film                 4.408084e-53   1.073869e-32    4.873295e-106   
Technology                9.371786e-17   1.694069e-12     7.699651e-45   
True Crime                0.000000e+00  1.094172e-255     9.328188e-18   

                                Others  Religion & Spirituality       Science  \
Arts                      1.000000e+00             6.765378e-16  4.432943e-06   
Business & Finance        3.298722e-22             3.452566e-01  4.643807e-52   
Comedy                    1.000000e+00             1.699017e-24  2.456006e-08   
Education                 4.993615e-14             1.000000e+00  1.792636e-36   
Fiction                   4.735407e-01             1.711756e-31  1.000000e+00   
Government                1.000000e+00             1.842690e-08  1.000000e+00   
Health & Fitness          7.292406e-09             3.908836e-01  5.776229e-28   
History                   1.000000e+00             4.483518e-09  1.513078e-01   
Kids & Family             1.000000e+00             1.546357e-16  8.093922e-07   
Leisure                   9.540704e-04             7.366326e-06  2.319729e-16   
Music                     1.988885e-02             6.419010e-05  4.524920e-12   
News & Politics           3.944493e-60             0.000000e+00  8.049610e-49   
Others                    1.000000e+00             3.306089e-14  8.731518e-01   
Religion & Spirituality   3.306089e-14             1.000000e+00  4.201078e-37   
Science                   8.731518e-01             4.201078e-37  1.000000e+00   
Society & Culture         1.000000e+00             5.741894e-60  5.402766e-03   
Sports & Recreation       1.000000e+00             4.183075e-53  5.099429e-02   
TV & Film                 1.789166e-04            1.175773e-118  1.000000e+00   
Technology                4.908211e-01             3.807213e-37  1.000000e+00   
True Crime               1.214218e-102             0.000000e+00  6.849347e-94   

                         Society & Culture  Sports & Recreation  \
Arts                          2.912177e-01         1.031376e-01   
Business & Finance           1.422405e-105         2.785183e-89   
Comedy                        4.170844e-04         2.301198e-04   
Education                     1.203963e-56         1.106336e-50   
Fiction                       6.091772e-03         4.060352e-02   
Government                    1.000000e+00         1.000000e+00   
Health & Fitness              7.646485e-44         2.558988e-38   
History                       1.000000e+00         1.000000e+00   
Kids & Family                 5.932844e-02         2.076450e-02   
Leisure                       2.669238e-15         7.920740e-15   
Music                         1.039558e-07         4.659819e-08   
News & Politics              3.832457e-279        5.848267e-214   
Others                        1.000000e+00         1.000000e+00   
Religion & Spirituality       5.741894e-60         4.183075e-53   
Science                       5.402766e-03         5.099429e-02   
Society & Culture             1.000000e+00         1.000000e+00   
Sports & Recreation           1.000000e+00         1.000000e+00   
TV & Film                     5.367599e-26         1.947548e-17   
Technology                    2.159556e-03         2.184989e-02   
True Crime                    0.000000e+00         0.000000e+00   

                             TV & Film    Technology     True Crime  
Arts                      2.467110e-21  1.726284e-06  1.675757e-248  
Business & Finance       5.671723e-175  1.046303e-51   0.000000e+00  
Comedy                    1.708955e-38  9.140113e-09   0.000000e+00  
Education                2.074659e-113  1.528424e-36   0.000000e+00  
Fiction                   1.000000e+00  1.000000e+00   8.344859e-69  
Government                1.000000e+00  1.000000e+00   1.960076e-15  
Health & Fitness         2.970267e-101  3.717438e-28   0.000000e+00  
History                   2.436026e-05  8.244841e-02   8.203795e-92  
Kids & Family             1.233376e-24  3.088103e-07  4.355996e-276  
Leisure                   4.408084e-53  9.371786e-17   0.000000e+00  
Music                     1.073869e-32  1.694069e-12  1.094172e-255  
News & Politics          4.873295e-106  7.699651e-45   9.328188e-18  
Others                    1.789166e-04  4.908211e-01  1.214218e-102  
Religion & Spirituality  1.175773e-118  3.807213e-37   0.000000e+00  
Science                   1.000000e+00  1.000000e+00   6.849347e-94  
Society & Culture         5.367599e-26  2.159556e-03   0.000000e+00  
Sports & Recreation       1.947548e-17  2.184989e-02   0.000000e+00  
TV & Film                 1.000000e+00  1.000000e+00  1.353873e-217  
Technology                1.000000e+00  1.000000e+00   3.030409e-87  
True Crime               1.353873e-217  3.030409e-87   1.000000e+00  

Categories Rating Count Differences

This analysis focuses on examining whether there are rating count differences between podcast categories in the database.

Target Population: The target population consists of all podcast reviews available in the dataset. However, for the purpose of statistical inference, we are working with a sample of the data rather than the entire population.

Significance Levels:
The chosen significance level for hypothesis testing α = 0.05.

Confidence Intervals:
The confidence intervals for the count of ratings within podcast categories are as follows:

Arts: (0.999057, 1.000314) Business & Finance: (0.999591, 1.000136) Comedy: (0.999599, 1.000134) Education: (0.999440, 1.000187) Fiction: (0.997673, 1.000776) Government: (0.989482, 1.003506) Health & Fitness: (0.999592, 1.000136) History: (0.997084, 1.000972) Kids & Family: (0.999177, 1.000274) Leisure: (0.999327, 1.000224) Music: (0.998823, 1.000392) News & Politics: (0.999577, 1.000141) Others: (0.997670, 1.000777) Religion & Spirituality: (0.999483, 1.000172) Science: (0.998288, 1.000571) Society & Culture: (0.999825, 1.000058) Sports & Recreation: (0.999673, 1.000109) TV & Film: (0.999635, 1.000122) Technology: (0.998191, 1.000603) True Crime: (0.999580, 1.000140) Each interval provides a range of plausible values for the true population parameter, the count of ratings for podcasts within the corresponding category, with a specified level of confidence of 95%. For example, in the 'Arts' category, we can be 95% confident that the true count of ratings falls within the interval (0.999057, 1.000314). These confidence intervals help to understand the variability in the count of ratings across different podcast categories.

In [23]:
category_counts = sampled_data.groupby("category")["rating"].count()

confidence_intervals = []

for count, size in zip(category_counts, sampled_data.groupby("category").size()):
    if size < 2:
        ci_low, ci_high = count, count
    else:
        ci_low, ci_high = proportion_confint(
            count, size, alpha=1 - confidence_level, method="wilson"
        )

    if ci_low == ci_high:
        margin_of_error = 0
    else:
        margin_of_error = (ci_high - ci_low) / 2

    ci_low_adjusted = max(0, ci_low - margin_of_error)
    ci_high_adjusted = ci_high + margin_of_error

    confidence_intervals.append((ci_low_adjusted, ci_high_adjusted))

print("Confidence Intervals for Count of Ratings within Categories:")
for category, ci in zip(category_counts.index, confidence_intervals):
    print(f"{category}: {ci}")
Confidence Intervals for Count of Ratings within Categories:
Arts: (0.9990565917157656, 1.0003144694280781)
Business & Finance: (0.999591416506534, 1.000136194497822)
Comedy: (0.9995991755858225, 1.0001336081380594)
Education: (0.9994404469855083, 1.0001865176714972)
Fiction: (0.9976726344045528, 1.0007757885318156)
Government: (0.9894820150277689, 1.0035059949907437)
Health & Fitness: (0.999591677085669, 1.0001361076381103)
History: (0.9970836788522089, 1.0009721070492636)
Kids & Family: (0.9991770467433561, 1.0002743177522146)
Leisure: (0.9993270705455949, 1.000224309818135)
Music: (0.998823044357235, 1.000392318547588)
News & Politics: (0.9995770200916998, 1.0001409933027667)
Others: (0.9976698108928547, 1.0007767297023817)
Religion & Spirituality: (0.99948269411569, 1.0001724352947698)
Science: (0.9982880393204676, 1.0005706535598442)
Society & Culture: (0.9998245366611082, 1.0000584877796306)
Sports & Recreation: (0.9996728602193614, 1.000109046593546)
TV & Film: (0.9996350930223659, 1.0001216356592117)
Technology: (0.9981913135648709, 1.0006028954783766)
True Crime: (0.9995801023972818, 1.0001399658675725)

Data Sample Sizes & Independence of Observations Check:

In [24]:
contingency_table = pd.crosstab(sampled_data["category"], sampled_data["rating"])

independence_check_passed = check_independence(contingency_table)
appropriate_sample_sizes_passed = check_sample_sizes(contingency_table)

if independence_check_passed and appropriate_sample_sizes_passed:
    print("Assumptions for the chi-square test are met.")
else:
    print(
        "Assumptions for the chi-square test are not fully met. Further examination may be required."
    )
Assumptions for the chi-square test are met.

Statistical Hypotheses:
Null Hypothesis (H0): There is no difference in average number of people voting for podcast within existing categories.
Alternative Hypothesis (H1): There is a difference in average number of people voting for podcast within existing categories.

Hypothesis Testing:
To investigate the differences in the average number of people voting for podcasts within categories, the chi-square test was conducted. This choice was justified by the categorical nature of the data, aligning with the test's suitability for analyzing counts or proportions across categorical variables. Prior to conducting the test, we ensured that the assumptions of the chi-square test, including the independence of observations and appropriate sample sizes, were met. The resulting test statistic of 7812.36, coupled with a p-value close to 0.0, provided compelling evidence against the null hypothesis, indicating significant differences in the average number of people voting for podcasts within categories. This rigorous approach to hypothesis testing enables us to conclude that meaningful disparities exist in the voting patterns across various podcast genres.

In [25]:
contingency_table = pd.crosstab(sampled_data["category"], sampled_data["rating"])

chi2_stat, p_val, dof, expected = stats.chi2_contingency(contingency_table)

alpha = 0.05
print("Chi-square Statistic:", chi2_stat)
print("P-value:", p_val)
print("Degrees of Freedom:", dof)

if p_val < alpha:
    print(
        "Reject the null hypothesis. There is a significant difference between ratings within podcast categories."
    )
else:
    print(
        "Fail to reject the null hypothesis. There is no significant difference between ratings within podcast categories."
    )

print("Expected Frequencies Table:")
print(expected[:5])
Chi-square Statistic: 7812.363250371442
P-value: 0.0
Degrees of Freedom: 76
Reject the null hypothesis. There is a significant difference between ratings within podcast categories.
Expected Frequencies Table:
[[  332.62953673   130.71659946   148.21530992   179.51582018
   5312.92273372]
 [  768.30665765   301.92879026   342.34725664   414.64507679
  12271.77221866]
 [  783.18343739   307.77506019   348.97615238   422.67388068
  12509.39146937]
 [  560.95813418   220.44506468   249.95550464   302.74178456
   8959.89951194]
 [  134.70842313    52.93765299    60.02428672    72.70037803
   2151.62925913]]

Temporal Analysis

This analysis focuses on examining whether there are rating mean differences between months in a database.

Target Population: The target population consists of all podcast reviews available in the dataset. However, for the purpose of statistical inference, we are working with a sample of the data rather than the entire population.

Significance Levels:
The chosen significance level for hypothesis testing α = 0.05.

Confidence Intervals:
The confidence intervals for the mean rating by each month are as follows: Month 1: (4.615, 4.634) Month 2: (4.639, 4.658) Month 3: (4.632, 4.651) Month 4: (4.657, 4.676) Month 5: (4.653, 4.672) Month 6: (4.619, 4.638) Month 7: (4.629, 4.648) Month 8: (4.622, 4.641) Month 9: (4.637, 4.655) Month 10: (4.628, 4.646) Month 11: (4.627, 4.646) Month 12: (4.606, 4.624) Each interval provides a range of plausible values for the true population parameter, the mean rating score for all podcasts in the dataset for the corresponding month, with a specified level of confidence of 95%. The lower bound of each confidence interval represents the lower estimate of the mean rating score, while the upper bound represents the upper estimate. We can be 95% confident that the true mean rating score for each month falls within its respective interval.

In [26]:
sampled_data["created_at"] = pd.to_datetime(sampled_data["created_at"])
sampled_data["month"] = sampled_data["created_at"].dt.month

monthly_rating = sampled_data.groupby("month")["rating"].mean()
mean_rating = monthly_rating.mean()
std_rating = monthly_rating.std()
std_error = std_rating / len(monthly_rating) ** 0.5

# T-score for 95% confidence level
t_score = t.ppf(0.975, df=len(monthly_rating) - 1)

confidence_intervals = []
for month, rating in monthly_rating.items():
    lower_bound = rating - t_score * std_error
    upper_bound = rating + t_score * std_error
    confidence_intervals.append((month, lower_bound, upper_bound))

print("Confidence intervals for mean rating by each month:")
for month, lower, upper in confidence_intervals:
    print(f"Month {month}: ({lower}, {upper})")
Confidence intervals for mean rating by each month:
Month 1: (4.614953467004247, 4.633840158851438)
Month 2: (4.639409311081235, 4.658296002928426)
Month 3: (4.631804807695311, 4.650691499542503)
Month 4: (4.657015545804884, 4.675902237652076)
Month 5: (4.65336809670944, 4.672254788556631)
Month 6: (4.61902761321306, 4.637914305060251)
Month 7: (4.629482423539289, 4.648369115386481)
Month 8: (4.622132565474651, 4.641019257321842)
Month 9: (4.636555840548599, 4.65544253239579)
Month 10: (4.627515799051409, 4.6464024908986)
Month 11: (4.627207014529546, 4.646093706376737)
Month 12: (4.6055529016031835, 4.624439593450375)

Data Normality and Homogeneity of Variances Tests:

In [27]:
month_groups = [group["rating"] for _, group in sampled_data.groupby("month")]

shapiro_p_values = [shapiro(rating)[1] for rating in month_groups]
print("Shapiro-Wilk Test p-values:", shapiro_p_values)

levene_stat, levene_p_value = levene(*month_groups)
print("Levene's Test p-value:", levene_p_value)
/tmp/ipykernel_7695/3141775468.py:3: UserWarning:

scipy.stats.shapiro: For N > 5000, computed p-value may not be accurate. Current N is 17822.

/tmp/ipykernel_7695/3141775468.py:3: UserWarning:

scipy.stats.shapiro: For N > 5000, computed p-value may not be accurate. Current N is 16560.

/tmp/ipykernel_7695/3141775468.py:3: UserWarning:

scipy.stats.shapiro: For N > 5000, computed p-value may not be accurate. Current N is 16248.

/tmp/ipykernel_7695/3141775468.py:3: UserWarning:

scipy.stats.shapiro: For N > 5000, computed p-value may not be accurate. Current N is 16043.

/tmp/ipykernel_7695/3141775468.py:3: UserWarning:

scipy.stats.shapiro: For N > 5000, computed p-value may not be accurate. Current N is 16255.

/tmp/ipykernel_7695/3141775468.py:3: UserWarning:

scipy.stats.shapiro: For N > 5000, computed p-value may not be accurate. Current N is 16494.

/tmp/ipykernel_7695/3141775468.py:3: UserWarning:

scipy.stats.shapiro: For N > 5000, computed p-value may not be accurate. Current N is 16570.

/tmp/ipykernel_7695/3141775468.py:3: UserWarning:

scipy.stats.shapiro: For N > 5000, computed p-value may not be accurate. Current N is 17336.

/tmp/ipykernel_7695/3141775468.py:3: UserWarning:

scipy.stats.shapiro: For N > 5000, computed p-value may not be accurate. Current N is 17209.

/tmp/ipykernel_7695/3141775468.py:3: UserWarning:

scipy.stats.shapiro: For N > 5000, computed p-value may not be accurate. Current N is 17403.

/tmp/ipykernel_7695/3141775468.py:3: UserWarning:

scipy.stats.shapiro: For N > 5000, computed p-value may not be accurate. Current N is 15536.

/tmp/ipykernel_7695/3141775468.py:3: UserWarning:

scipy.stats.shapiro: For N > 5000, computed p-value may not be accurate. Current N is 14657.

Shapiro-Wilk Test p-values: [1.8747089767737397e-118, 7.961967357139668e-117, 2.954699448513843e-116, 1.9349365936573582e-116, 1.135387911507323e-116, 3.224140871100971e-116, 1.3108350380063052e-116, 1.4692818787365716e-117, 1.0922381339800924e-117, 6.496702124256376e-118, 1.3327205331084589e-114, 1.9856925052677519e-112]
Levene's Test p-value: 0.0001316362139861504

Statistical Hypotheses:
Null Hypothesis (H0): There are specific time periods are associated with higher or lower ratings.
Alternative Hypothesis (H1): There are no specific time periods are associated with higher or lower ratings.

Hypothesis Testing:
To examine the relationship between review time periods and ratings, the Kruskal-Wallis test was employed. This choice was made due to concerns regarding the normality assumption for the data, as indicated by the Shapiro-Wilk test results. The Kruskal-Wallis test is a non-parametric alternative to ANOVA and is well-suited for analyzing the effects of a single categorical factor (time periods) on a continuous outcome variable (ratings) when the normality assumption is violated. The resulting test statistic of 40.14 and a p-value close to 0.0 provided compelling evidence against the null hypothesis, indicating that specific time periods are indeed associated with higher or lower ratings. This robust approach to hypothesis testing underscores the significance of time periods in influencing ratings and enhances our understanding of temporal dynamics in podcast reviews.

In [28]:
month_groups = [group["rating"] for _, group in sampled_data.groupby("month")]
h_statistic, p_value = kruskal(*month_groups)

print(f"P-value is: {p_value}, and test statistic is: {h_statistic}")

if p_value < 0.05:
    print("There are significant differences in ratings among different months.")
else:
    print("No significant differences in ratings among different months were found.")
P-value is: 3.385299573556052e-05, and test statistic is: 40.14017461949268
There are significant differences in ratings among different months.
In [29]:
dunn_results = posthoc_dunn(
    sampled_data, val_col="rating", group_col="month", p_adjust="bonferroni"
)
print(dunn_results)
          1         2         3         4         5         6         7   \
1   1.000000  1.000000  1.000000  0.042600  0.336116  1.000000  1.000000   
2   1.000000  1.000000  1.000000  1.000000  1.000000  1.000000  1.000000   
3   1.000000  1.000000  1.000000  1.000000  1.000000  1.000000  1.000000   
4   0.042600  1.000000  1.000000  1.000000  1.000000  0.070442  0.508347   
5   0.336116  1.000000  1.000000  1.000000  1.000000  0.497301  1.000000   
6   1.000000  1.000000  1.000000  0.070442  0.497301  1.000000  1.000000   
7   1.000000  1.000000  1.000000  0.508347  1.000000  1.000000  1.000000   
8   1.000000  1.000000  1.000000  0.018038  0.159561  1.000000  1.000000   
9   1.000000  1.000000  1.000000  1.000000  1.000000  1.000000  1.000000   
10  1.000000  1.000000  1.000000  0.462842  1.000000  1.000000  1.000000   
11  1.000000  1.000000  1.000000  0.069872  0.481064  1.000000  1.000000   
12  1.000000  0.106053  0.130529  0.000034  0.000579  1.000000  0.922705   

          8         9         10        11        12  
1   1.000000  1.000000  1.000000  1.000000  1.000000  
2   1.000000  1.000000  1.000000  1.000000  0.106053  
3   1.000000  1.000000  1.000000  1.000000  0.130529  
4   0.018038  1.000000  0.462842  0.069872  0.000034  
5   0.159561  1.000000  1.000000  0.481064  0.000579  
6   1.000000  1.000000  1.000000  1.000000  1.000000  
7   1.000000  1.000000  1.000000  1.000000  0.922705  
8   1.000000  1.000000  1.000000  1.000000  1.000000  
9   1.000000  1.000000  1.000000  1.000000  0.410355  
10  1.000000  1.000000  1.000000  1.000000  0.851710  
11  1.000000  1.000000  1.000000  1.000000  1.000000  
12  1.000000  0.410355  0.851710  1.000000  1.000000  
In [30]:
cnx.close()

OUTCOMES:

  • To test the hypothesis regarding the difference in average ratings between podcast categories, the Kruskal-Wallis test was conducted. The test statistic obtained was 7045.29, and the corresponding p-value was 0.0. Using a significance level of α = 0.05, the p-value was compared to the chosen significance level. Based on the results, we reject the null hypothesis.
  • To test the hypothesis regarding the difference in rating counts between different podcasts categories, the chi-square test was conducted. The test statistic obtained was 7812.36, and the corresponding p-value was 0.0. Using a significance level of α = 0.05, the p-value was compared to the chosen significance level. Based on the results, we reject the null hypothesis.
  • To test the hypothesis regarding the difference in average ratings between different months, an analysis of variance Kruskal-Wallis test was conducted. The test statistic obtained was 40.14, and the corresponding p-value was 0.0. Using a significance level of α = 0.05, the p-value was compared to the chosen significance level. Based on the results, we reject the null hypothesis.

INSIGHTS:

  • The results of the Dunn's test and Kruskal-Wallis tests indicate that there is sufficient evidence to conclude that there are differences in average ratings between podcasts categories. Specifically Arts, History, Kids & Family, Leisure, Music, Others, and Religion & Spirituality categories received significantly higher ratings compared to other categories, while Business & Finance Comedy, Government, Health & Fitness, News & Politics, Society & Culture, Sports & Recreation, TV & Film, Technology, True Crime categories received significantly lower ratings than other categories.
  • The results of the chi-square test indicate that there is sufficient evidence to conclude that there are difference in ratings count between podcast categories.
  • Based on the provided Dunn test results, we can conclude that there is a significant difference in ratings for April compared to other months, with April generally having lower ratings. However, there are no months with significantly higher ratings compared to others.


⇡¶

CONCLUSIONS

¶

Insights and Findings:

  1. Difference in Podcast Categories Ratings:

    • The analysis reveals significant differences in average ratings within podcast categories. Categories such as Arts, History, Kids & Family, Leisure, Music, Others, and Religion & Spirituality tend to receive higher ratings, whereas Business & Finance Comedy, Government, Health & Fitness, News & Politics, Society & Culture, Sports & Recreation, TV & Film, Technology, True Crime lower ratings on average. This highlights the importance of tailoring content and marketing strategies to better suit audience preferences within each category.
  2. Variation in Rating Count Across Podcast Categories:

    • Notable differences in ratings counts between different podcast categories suggest potential disparities in audience engagement and reach. Understanding these variations can inform decisions regarding content creation, promotion, and audience targeting strategies.
  3. Seasonal Trends in Podcast Ratings:

    • The monthly analysis shows variations in average ratings across different months. April exhibits lower average ratings compared to other months. These seasonal trends can provide valuable insights for content scheduling and promotion strategies, helping to maximize audience engagement and satisfaction throughout the year.

¶

Recommendations for Action:

  1. Enhancing Visibility and Engagement: Given the observed differences between podcast ratings and voting counts, strategies should be developed to enhance visibility and engagement for high-quality podcasts with potentially underrepresented reach.

  2. Audience-Centric Content Improvement: Exploring audience feedback and preferences can identify areas for content improvement. Analyzing listener reviews, ratings, and engagement metrics can help understand what resonates most with the audience, enabling adjustments to content strategies for better alignment with audience preferences.

¶

Further Areas for Investigation:

  1. Factors Influencing Voting Counts: Additional investigation is needed to explore factors influencing differences in voting counts among podcasts with similar average ratings. This could involve analyzing promotion strategies, audience demographics, or platform visibility to optimize audience engagement.

  2. Understanding Variation in Ratings Across Categories: Further examination of variables contributing to variations in average ratings across podcast categories, such as content format, host expertise, or audience demographics, can provide deeper insights into audience preferences and content performance.